Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning

نویسندگان

  • Barzan Mozafari
  • Purnamrita Sarkar
  • Michael J. Franklin
  • Michael I. Jordan
  • Samuel Madden
چکیده

Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. However, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are oŸen limited to small datasets (i.e., a few thousand items). is paper proposes algorithms for integrating machine learning into crowd-sourced databases in order to combine the accuracy of human labeling with the speed and costešectiveness of machine learning classiers. By using active learning as our optimization strategy for labeling tasks in crowd-sourced databases, we can minimize the number of questions asked to the crowd, allowing crowd-sourced applications to scale (i.e., labelmuch larger datasets at lower costs). Designing active learning algorithms for a crowd-sourced database posesmanypractical challenges: such algorithmsneed to be generic, scalable, and easy to use, even for practitioners who are notmachine learning experts. We draw on the theory of nonparametric bootstrap to design, to the best of our knowledge, the rst active learning algorithms that meet all these requirements. Our results, on 3 real-world datasets collected with AmazonsMechanical Turk, and on 15 UCI datasets, show that our methods on average ask 1–2 orders of magnitude fewer questions than the baseline, and 4.5–44× fewer than existing active learning algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Active Learning for Crowd-Sourced Databases

Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, or sentiment analysis. However, due to the time and cost of human labor, solutions that solely rely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items). This paper proposes algorithms for integr...

متن کامل

Active Learning and Crowd-Sourcing for Machine Translation

In recent years, corpus based approaches to machine translation have become predominant, with Statistical Machine Translation (SMT) being the most actively progressing area. Success of these approaches depends on the availability of parallel corpora. In this paper we propose Active Crowd Translation (ACT), a new paradigm where active learning and crowd-sourcing come together to enable automatic...

متن کامل

Pronunciation learning for named-entities through crowd-sourcing

Obtaining good pronunciations for named-entities poses a challenge for automated speech recognition because namedentities are diverse in nature and origin, and new entities come up every day. In this paper, we investigate the feasibility of learning named-entity pronunciations using crowd-sourcing. By collecting audio samples from non-linguistic-expert speakers with Mechanical Turk and learning...

متن کامل

Probabilistic Zero-shot Classification with Semantic Rankings

In this paper we propose a non-metric rankingbased representation of semantic similarity that allows natural aggregation of semantic information from multiple heterogeneous sources. We apply the ranking-based representation to zeroshot learning problems, and present deterministic and probabilistic zero-shot classifiers which can be built from pre-trained classifiers without retraining. We demon...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PVLDB

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2014